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PROBABILITY DENSITY 
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Australian National University 

We discuss properties of two methods for ascribing probabilities 
to the shape of a probability distribution. One is based on the idea of 
counting the number of modes of a bootstrap version of a standard 
kernel density estimator. We argue that the simplest form of that 
method suffers from the same difficulties that inhibit level accuracy 
of Silverman’s bandwidth-based test for modality: the conditional 
distribution of the bootstrap form of a density estimator is not a 
good approximation to the actual distribution of the estimator. This 
difficulty is less pronounced if the density estimator is oversmoothed, 
but the problem of selecting the extent of oversmoothing is inher¬ 
ently difficult. It is shown that the optimal bandwidth, in the sense 
of producing optimally high sensitivity, depends on the widths of 
putative bumps in the unknown density and is exactly as difficult 
to determine as those bumps are to detect. We also develop a sec¬ 
ond approach to ascribing a probability to shape, using Muller and 
Sawitzki’s notion of excess mass. In contrast to the context just dis¬ 
cussed, it is shown that the bootstrap distribution of empirical excess 
mass is a relatively good approximation to its true distribution. This 
leads to empirical approximations to the likelihoods of different levels 
of “modal sharpness,” or “delineation,” of modes of a density. The 
technique is illustrated numerically. 

1. Introduction. Assigning a probability, or a measure of likelihood, to 
a quantity determined by an infinite number of unknown parameters is an 
intrinsically difficult problem. This is particularly the case when definition 
of the function requires a certain level of smoothing, for example in the case 
of a probability density. It has recently been proposed [Efron and Tibshirani 
(1998)] that relative likelihoods of the numbers of modes of a density might 
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be calculated by, in effect, counting the numbers of modes of bootstrap 
versions of a kernel density estimator. This can be viewed as a development 
of Silverman’s (1981) bootstrap method for testing for the number of modes 
of a distribution; Silverman adjusted the bandwidth of the estimator until 
the mode count agreed with that specified by the null hypothesis. 

We argue that such bootstrap likelihoods do not converge in probability 
and in particular do not converge to the “truth” in the standard frequentist 
sense, unless the bandwidth is chosen an order of magnitude larger than 
would be appropriate for standard kernel density estimation. Using subsam¬ 
pling methods does not overcome this difficulty; if anything, those techniques 
make matters a little worse. The difficulties are related to the known level 
inconsistency of Silverman’s (1981) test for the number of modes. Indeed, 
both problems are rooted in the fact that the bootstrap distribution of a 
kernel density estimator is not a good approximation to the unconditional 
distribution of the estimator, if the bandwidth is of its usual pointwise op¬ 
timal size. 

If the bandwidth is allowed to take larger than usual values, then these 
problems recede. However, the difficulty then arises of determining how large 
the bandwidth should be. We show that this problem is essentially insoluble; 
the size of the bandwidth depends on the widths of small potential modes, 
the very existence of which one is trying to determine. 

There is, however, a second, related class of problems, where we may ex¬ 
ploit the fact that (under the assumption of a given number of modes) the 
“modal sharpness,” or extent of delineation of the modes of a density, can 
be accurately estimated in terms of empirical excess mass. Most important, 
in contrast to problems related to the likelihood of the number of modes, 
the distribution of empirical excess mass can be accurately approximated 
using the standard bootstrap, without requiring choice of a smoothing pa¬ 
rameter. In this way a set of graphs of constrained density estimates can 
be constructed, having excess masses that correspond to quantiles of the 
estimated distribution of excess mass for a given number (fc, say) of modes, 
and actually having k modes. A value of k can be determined by testing, or 
sets of graphs can be constructed for different numbers of modes. 

We discuss these two approaches as much because they contrast as be¬ 
cause they are similar. The first, density estimator-based technique cannot 
be interpreted in frequentist terms, and indeed Donoho’s (1988) results es¬ 
sentially imply that there is not a meaningful way of empirically assessing 
the likelihood that a density has m modes. The second method sidesteps 
these difficulties by eschewing the problem of computing a likelihood for 
modality, and instead focuses on measuring the “pointiness” of the density’s 
peaks and troughs. Since this approach has a conventional interpretation 
in frequentist terms, then it is necessarily different from the first, but the 
two are plainly connected; the number of modes of a probability density is 
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closely related to its excess mass, not least through the fact that empirical 
approximations to the latter are used to test hypotheses about the former. 

Another similarity is that both methods claim to attribute a probability to 
the shape of a density. For the first method the probability is the likelihood 
that the density has a given number of modes, while for the second it is the 
coverage probability of a confidence interval for excess mass. 

Section 2.1 will discuss issues of bandwidth choice in the first class of 
problems, where the likelihood of modality is approximated by mode count¬ 
ing. The second problem class, where shape is described in terms of excess 
mass, will be discussed in Section 2.2. Both accounts rely critically on theo¬ 
retical properties, which will be described in Section 3. Efron and Tibshirani 
(1998) have already carefully worked through numerical examples in the first 
problem class, and so in the numerical work in Section 2.2 we shall confine 
attention to the second class. 

2. Methodology and general properties. 

2.1. Counting modes of a kernel estimator. Efron and Tibshirani (1998) 
developed an engaging and particularly original approach to solving prob¬ 
lems that are more general than that considered in the present paper. Efron 
and Tibshirani’s method allows them to ascribe a “probability” to the event 
that a density / is bimodal, by in effect counting the number of modes in 
the bootstrap form of an appropriately constructed kernel estimator, for 
example, 

1=1 

where K is a known probability density, h denotes the bandwidth and 
X = {Xi ,..., X n } is a random sample drawn from the distribution with den¬ 
sity /. The arguments Efron and Tibshirani use are, of necessity because 
the range of problems they treat is so broad, heuristic rather than rigorously 
mathematical. We shall argue that, in the context of modality of densities, 
their definition of the amount of probability attributable to different density 
shapes is not interpretable as a probability in the usual frequentist sense. 

Related issues were addressed by Donoho (1988), who demonstrated on 
essentially topological grounds that the probability that a density has at 
least k modes is definable, whereas the chance that the density has exactly 
k modes is not. We should mention too that Efron and Tibshirani’s method 
is somewhat more sophisticated than the counting approach we shall discuss 
below, for example through being founded on Gaussian-based transforma¬ 
tions of the bootstrap distributions of numbers of counts. However, since 
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the large sample distributions of those counts are not approximately Nor¬ 
mally distributed (see, e.g., Theorem 3.1), it is not difficult to show that our 
conclusions apply to the more complex method. 

The method that Efron and Tibshirani (1998) suggest using for band¬ 
width selection, that is, ten-fold cross-validation, produces (as it is designed 
to) a bandwidth of an order that is asymptotically optimal for pointwise ac¬ 
curacy of the estimator. In particular, the bandwidth is of size n~ 1//5 , where 
n denotes sample size. This will prove important in the first part of our 
discussion, although later we shall consider larger bandwidths. 

We shall show in Theorem 3.1 that for such a bandwidth, the bootstrap 
distribution of the number of modes of the bootstrapped density estimator 
converges in distribution but not in probability; the latter, not the former, 
is the usual sense in which, for practical reasons, one wishes a bootstrap 
quantity to converge. Moreover, the in-distribution limit does not accurately 
reflect the number of modes of the sampled distribution. In particular, even if 
the sampled distribution is strictly unimodal, the weak limit of the bootstrap 
likelihood is nondegenerately supported on the set of all strictly positive 
integers. 

Naturally one seeks a way of overcoming these difficulties. The method 
of subsampling, or the “m out of n bootstrap” as it is sometimes called, 
has a good reputation for remedying convergence problems in a wide range 
of applications of the bootstrap. See, for example, Bickel and Ren (1996), 
Lee (1999) and Politis, Romano and Wolf (1999). In Theorem 3.2 we shall 
show, however, that in the context of estimating the number of modes of /, 
subsampling actually tends to impair performance of the bootstrap when the 
bandwidth is of size n -1 / 5 . It results in the likelihood being approximated 
by an indicator function, and so the bootstrap estimate of the probability 
that the sampled density has k modes is well approximated by a random 
variable that takes only the values 0 and 1. This indicator variable does 
not converge in probability. It does converge in distribution, but not to the 
deterministic indicator of the number of modes of the true density. 

The landscape changes markedly when a larger order of bandwidth is 
employed, however. If modes and local minima of the density are “clearly 
defined,” in the sense that the curvature of the density does not vanish at 
those turning points and the density has no shoulders, and if the bandwidth 
converges to 0 at a strictly slower rate than n -1 / 5 , then the probability that 
the density estimator has the same number of modes as the true density 
converges to 1 as n —* oo. This result, and those discussed earlier, are valid 
provided we avoid spurious small modes in the tails that arise from data 
sparseness. This problem is commonly addressed as part of kernel-based 
inference for the number of modes; see, for example, Fisher, Mammen and 
Marron (1994) and Hall and York (2001). [A density / has a “shoulder” at 
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a point x if both f'{x) and f"{x ) vanish and x is not a turning point. To 
remove the latter possibility it is usual to assume f"{x) ^ 0 if f'{x ) = 0.] 

In reality, however, the issues are more complex than this simple asymp¬ 
totic account suggests. The most important problems involving determina¬ 
tion of the number of modes are arguably those where the modes are not 
“clearly defined” in the context discussed immediately above. Examples in¬ 
clude problems where it is difficult to distinguish between a small mode 
and a shoulder. To some extent these instances too can be satisfactorily 
addressed by simply counting the number of modes of a kernel density esti¬ 
mator, as proposed by Efron and Tibshirani (1998), although now the choice 
of bandwidth becomes a more critical issue. Theorem 3.3 will show that the 
bandwidth should now be at least an order of magnitude larger than n -1 / 7 ; 
otherwise, spurious additional modes will be introduced in the region of a 
shoulder, if the density should have a shoulder rather than a small mode. 
Therefore, the bandwidth for the density estimator that will enable a bump 
to be detected must be strictly narrower than the bump for which we are 
looking. 

Moreover, the bandwidth should not be too large, or we shall smooth 
the bump into the shoulder and miss it altogether. For example, suppose 
a small bump is constructed above the shoulder, of width hi and with its 
height chosen so that the density estimator continues to have three bounded 
derivatives. Take h\ = h\{n) to converge to 0 as n —> oc, so that the problem 
becomes more complex as more information becomes available. In order to 
correctly distinguish the bump as a mode, by counting the number of modes 
of a kernel density estimator, the bandwidth for the latter must converge to 
zero at a rate that is strictly faster than h \. These results, which are made 
concise in Theorem 3.4, also hold if we count the number of modes of the 
bootstrap form of the density estimator. 

2.2. Excess mass as a descriptor of density shape. The notion of ex¬ 
cess mass was introduced by Muller and Sawitzki (1991), and has been dis¬ 
cussed extensively; see, for example, Polonik (1995, 1998), Gezeck, Fischer 
and Tirnmer (1997), Cheng and Hall (1998), Chaudhuri and Marron (1999, 
2000), Polonik and Yao (2000) and Fisher and Marron (2001). It is closely 
related to Hartigan and Hartigan’s (1985) notion of a “dip” in a distribution, 
and in fact Hartigan and Hartigan’s dip test for unimodality is equivalent, in 
one dimension, to the excess mass test. Either approach can be thought of as 
being based on the “taut string” method for constructing an empirical dis¬ 
tribution that is constrained to be unimodal. That technique has a range of 
applications to other problems, including monotone and convex approxima¬ 
tion [e.g., Leurgans (1982)], nonparametric regression more generally [e.g., 
Marnmen and van de Geer (1997) and Davies and Kovac (2001)] and data 
exploration [Davies (1995)]. 
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Excess mass of order m > 1, and the corresponding excess mass difference, 
are defined, respectively, by 


( 2 . 2 ) 


E m (X)= sup ^{E(Lj) — A||Lj||}, 

^m = sup{F m (A) — E m _|(A)}, 

A>0 


in which the first supremum is taken over all sequences L i,..., L m of disjoint 
intervals, and ||L|| denotes the length of L. The empirical form, A m , of A m is 
obtained by replacing E m ( A) and F m _i(A) at (2.2) by E m { A) and F m _i(A), 
respectively, where E m ( A) is dehned as at (2.2) but with F replaced by the 
empirical distribution function F based on the dataset X. Properties of A m 
directly reflect those of A m , not least through the fact that A m is consistent 
for A m as n —> oo. 

To appreciate the connection between A m and the shape of the density /, 
observe that when m = 2 and / is bimodal, A m equals the least amount of 
mass that needs to be removed from one of the modes, and placed into the 
trough between them, in order to render / unimodal. In particular, A 2 = 0 
if and only if / is unimodal. For a general / and for rri > 2, A m = 0 if/ 
has no more than m — 1 modes, although the converse is not generally true 
for m > 3. For instance, A 3 = 0 for a strictly trimodal density if and only 
if the height of either of the outer modes does not exceed the height of the 
local minimum between the other two modes. One can reasonably argue that 
in this case the lowest mode is insubstantial relative to the other two, and 
that “A 3 = 0 if and only if the density / has no more than two relatively 
substantial modes.” Analogous interpretations are valid for m > 4. 

The fact that the excess mass statistic does not exactly relate to the 
number of modes (not least because “insubstantial modes” do not directly 
influence the statistic) means that our approach to ascribing a probabil¬ 
ity to density shape is quite different from the mode-focused method sug¬ 
gested by Efron and Tibshirani (1998). Our approach is clearly influenced 
by modality, but is far from being driven by it. It measures the shape of 
a distribution using information about mode “strength,” and in some ways 
pays scant attention to the number of modes. It is partially linked to Efron 
and Tibshirani’s (1998) approach through work of Chaudhuri and Marron 
(1999, 2000), which emphasizes mode counts but nevertheless assesses the 
strengths of putative modes. 

Focusing on the case A m = 0 addresses only one example of the ways 
in which A m reflects the shape of /. More generally, the fact that E m ( A) 
represents the maximum deviation of / from a composition of m uniform 
distributions of height A implies that as A m increases at least some of the 
modes of / become more pronounced. To gain insight into this property, 
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consider the case where / is the density of a mixture of p > m Normal 
N{ni,af) populations, with distinct fixed means ni,...,p p and respective 
nonvanishing, fixed mixing proportions 7ri,...,7r p . The supremum of A m , 
over all such densities, equals the sum of the p — m + 1 smallest values of tt z . 
and is attained by letting the corresponding p — m + 1 values of <7j decrease 
to zero. In particular, the modes corresponding to these distributions in the 
mixture become infinitely sharp spikes. Furthermore, if the mixture density 
has p modes then the maximum value of A m is attained only by letting 
the p — m + 1 values of er* (corresponding to the p — m + 1 smallest 7Tj’s) 
converge to zero. 

The simplest case in the Normal mixture example is that where m = p = 2 
and 0 < 7Ti < 7T2 < 1. There, A 2 > 0 if and only if 01 and <72 are chosen so that 
the Normal mixture is bimodal, and or —> 0 as A 2 increases to its maximum 
value, 7Ti. (The limit A 2 —> 7ir can be attained with <72 fixed and o\ —* 0, 
and also with or, <72 —> 0 together.) 

Properties such as those discussed in the three previous paragraphs argue 
that for fixed m, relatively large values of A m are associated with densities 
/ that have more than m — 1 “substantive” modes, and with all but m — 1 
of the modes being relatively sharp. 

2.3. Imposing constraints on excess mass. If we were to construct / in 
such a way that A m (f) = A m then we would be allowing the estimator to 
reflect the actual empirical level of “modal sharpness.” Note particularly that 
calculation of A m does not involve any smoothing, whereas / does require a 
smoothing parameter. Conceptually, computing / subject to A m (f) = A m is 
similar to constructing a density estimator subject to one or more moments 
of the distribution with density / being equal to the corresponding empirical 
moments for the dataset X. Advantages of the latter procedure have been 
discussed by Jones (1991) and Hall and Presnell (1999), for example. In 
practice, however, the task is significantly more difficult when the constraint 
is in terms of excess mass, not least because A m (f) is a highly nonlinear 
function of /. 

We can be more bold than to ask simply that A m (f) = A m . The dis¬ 
tribution of A m may be approximated using bootstrap methods (see Theo¬ 
rem 3.5), and estimates of the quantiles of the distribution may be computed. 
In this way we may construct versions of / under the constraint that its ex¬ 
cess mass equals any given quantile, thereby computing density estimates 
that reflect the sharpness of the true density in a median sense or in the 
sense of any given probability for excess mass. 

This procedure can be implemented using data-sharpening methods [Choi 
and Hall (1999) and Braun and Hall (2004)], to impose constraints on esti¬ 
mator shape. The method produces a new estimator fy, computed as was 
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/ = fx but from a sharpened dataset y, with excess mass A, say. [We 
would usually choose A to be an estimator of a quantile of the distribution 
of A m(fy)-} The method starts with a density estimator, /, which could be 
either fx or fz, the latter being another version of fx , this time constructed 
for another sharpened sample Z. 

In the latter case, Z might be deliberately constructed so that fz has a 
different shape from fx- Discussion of the principle of data sharpening, and 
of reasons why an intermediate dataset, Z, might be generated from X prior 
to using data sharpening to impose a constraint on excess mass, is given in 
Appendix A. An algorithm for data sharpening is presented in Appendix B. 

The number of modes of fy will be determined partly by the number of 
modes of /, and partly by the numerical value chosen for A. For example, 
if / is unimodal, implying that A 2 (/) = 0, but we take m = 2 and A > 0, 
then fy will have a second mode, generally becoming more pronounced as 
A increases. If / is trimodal then constraining A 3 (fy) to equal A < A 3 (/), 
and steadily reducing A to zero, may reduce the number of modes to two or 
may simply reduce the height of one of the two outer modes to the height of 
the local minimum between the other two modes. The outcome here depends 
on /. If / is trimodal, and if one of the modes is “insubstantial” (in the sense 
of Section 2.2), then constraining A 3 (fy) to equal A > A 3 (/) will often make 
that mode more pronounced; the mode will not be smoothed away. (More 
generally, all the modes to which we refer above are modes in the usual 
sense, i.e., local maxima of the density. They can be either “substantial” or 
“insubstantial” from the viewpoint of excess mass.) 

There are potential alternative approaches, although they are difficult to 
implement in practice. One might consider using a single, fixed bandwidth, 
and vary it to ensure a given value of empirical excess mass. However, this 
approach is so strongly influenced by data in the tails of the distribution that 
it is often impractical. For example, if / is a Normal density, and the desired 
number of modes equals one, then the bandwidth must diverge to infinity 
with sample size, at rate at least (logn) 1 / 2 , in order to ensure that empirical 
excess mass difference equals zero (equivalently, that the density estimator is 
unimodal). An alternative technique would be to use a bandwidth that varies 
with location, but that approach too is strongly influenced by outlying data 
and is difficult to use to estimate densities with a given number of modes 
when that number exceeds one. Moreover, even under the constraint of a 
single mode it is difficult to select a variable bandwidth that produces a 
given value of excess mass. 

2.4. Real-data illustration of constraints on excess mass. We illustrate 
application of our data-sharpening method to the chondrite dataset of Good 
and Gaskins (1972, 1980). There is evidence [e.g., Leonard (1978) and Sil¬ 
verman (1981)] that these data come from a distribution with at least two 
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Fig. 1. Density estimates calculated from sharpened versions of the chondrite data, where 
the extent of sharpening is such as to ensure the excess mass of the density estimate equals 
the a-level quantile of the bootstrap distribution of the excess mass statistic. In each panel 
the dotted curve depicts the conventional kernel estimator , using the same Sheather-Jones 
bandwidth as the sharpened versions shown by the unbroken curve. The values of a are 
0.005, 0.01, 0.05, 0.95, 0.99 and 0.995, and correspond to the curves shown in panels 
(n)-(i), respectively. 


modes, and likely no more than two modes. Good and Gaskins (1972), Si- 
rnonoff (1983) and Minnotte, Marchette and Wegman (1998) note that the 
chondrite dataset may contain evidence of three modes, while Miiller and 
Sawitzki (1991) are inconclusive in this regard. Evidence for the third mode 
is based on just three data points, and so is not strong; see Silverman’s 
contribution to the discussion of Leonard (1978). A kernel density estimate 
based on the chondrite data, with bandwidth chosen by the Sheather and 
Jones (1991) plug-in rule, is shown by the dotted lines in the panels of Fig¬ 
ure 1 and does in fact have just two modes. If it had three modes, say, we 
would use data sharpening to reduce one of the modes to a shoulder, so that 
the final density estimate had just two modes. Then in subsequent steps of 
our algorithm we would replace the real dataset by its sharpened form. 

Fixing the bandwidth, and using bootstrap simulation, we estimated quan¬ 
tiles of the distribution of A 2 for levels a = 0.005, 0.01, 0.05, 0.95, 0.99 and 0.995. 
The corresponding density estimates are depicted in Figure 1. They were 
computed using the algorithm given in Appendix B. Estimates for low val¬ 
ues of a are relatively close to being unimodal, while those for a close to 1 
have pronounced modes and antimodes. 
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We also applied our technique to the geyser dataset of Weisberg (1985) 
and Scott (1992), which consists of 107 eruption durations for the Old Faith¬ 
ful geyser. Tests for multimodality based on the excess mass statistic, cali¬ 
brated in a variety of ways, argue strongly that the sampled distribution has 
at least two modes; see Muller and Sawitzki (1991). There is no evidence 
of more than two modes, and in particular a kernel density estimate con¬ 
structed using the Sheather and Jones (1991) plug-in bandwidth shows two 
pronounced modes and not even a suggestion of a shoulder. Construction of 
the quantile curve estimates gives results similar to those in Figure 1. 

3. Theoretical properties of shape probabilities. 

3.1. Shape probabilities for fixed densities. Let / be as at ( 2 . 1 ), and de¬ 
note by f* the standard bootstrap form of /, computed from a resample X* 
derived by sampling randomly with replacement from X. Write N(n ) and 
N*(n ) for the numbers of modes (i.e., local maxima) of / and /*, respec¬ 
tively. Among other results we shall show that if n 1//5 h —»■ Cq > 0 as n —> oo 
then IV (to) and iV* (to) converge in distribution. The limit is degenerate if 
and only if Co = oo; in this case it is concentrated at the atom 1 . 

Next we describe the limiting distributions of N(n) and N*(n) in the 
“standard” case, where n x ^h —> Co G (0, oo). Let W and W* denote inde¬ 
pendent standard Brownian bridges, let xq be the mode of /, and assuming 
f(x o) > 0 and f'\x o) < 0 , put 

t(y) = f(x 0 ) 1/2 J K"(u)W(y + u ) du, 

C(y) = f(x 0 ) 1/2 [ K"(u)W*(y + u)du, 

(3.1) J 

v(y) = Cu 3/2 £(y) + Coyf"{x 0 ), 

y*(y) = Co 3/2 {C(y) + Civ)} + c 0 yf"(x 0 ), 

each stochastic process being defined for —oo<y<oo. Note that when K 
is the Gaussian kernel, each process is infinitely differentiable with prob¬ 
ability 1. Let N and N* denote the numbers of downcrossings of 0 by y 
and rf , respectively. Both random variables are well defined and finite with 
probability 1 and take only strictly positive integer values. We shall note in 
Theorem 3.1 that, under regularity conditions, the limiting distribution of 
N(n) is the distribution of N, and the limit of the bootstrap distribution of 
N*(n) may be expressed as the distribution of N* conditional on W. 

Assume / has two continuous derivatives on its support, which we take to 
equal S = [a, b] where —oo < a <b < oo, and that /(a) = f(b) = 0, f'{a+) > 0 
and f'(b—) < 0. Call this condition (C/i). Suppose too that in the inte¬ 
rior of S the equation f'{x) = 0 has a unique solution xq G (a, b), and that 
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f"(x o) < 0; call this (C/2)- Assume of the bandwidth that for some 5 > 0, 
h = h(n) = 0(n ~ s ) as n —> oo, and that n l ^h is bounded away from 0; call 
this condition (C h). For simplicity, and since the Gaussian kernel is by far the 
most commonly used in density estimation problems associated with shape, 
we shall suppose throughout that K(u) = (27r) -1 / 2 exp(— u 2 /2). However, 
since monotonicity of the number of modes of / as a function of h is not a 
concern in our work, the majority of our results hold for sufficiently smooth, 
unimodal, compactly supported kernels such as the triweight. In such cases 
the inequalities f > 0 and f < 0 in the third probability at (3.3) should be 
replaced by nonsharp inequalities. 

Let xq denote the point at which / achieves its largest local maximum. 
Then xq is well defined with probability 1. 


Theorem 3.1 (Bootstrap approximation to distribution of N ). As¬ 
sume (C/i), (C/2) and (Ch), and that K is the Gaussian kernel. 

(a) If in addition to (C h) we have n x '^h —> Co € (0,oo), then 

sup \P{N(n) = k} - P(N = k)\ -c 0 

fc >0 

as n —> 00. While this result continues to hold if N (n) and N are replaced 
by N*(n) and N*, respectively, the bootstrap distribution of N*(n) does not 
converge in the usual sense. Indeed, there exists a construction of (W, W*) 
that depends on X and is such that 

(3.2) sup \P{N*(n) = k\X} - P{N* = k\W)\ -f 0 

fc >0 

in probability as n —> 00. 

(b) If, on the other hand n l ^h —> 00, then both P{N(n) = 1} and P{N*(n ) 
1} converge to 1, and so with probability converging to 1 both f and f* are 
unimodal. Furthermore, if n^ 1 ^ 5 ^ s h —> 00 for some 6 > 0, then each of the 
probabilities 


(3.3) 
equals 1 


P{N(n) = 1}, P{N*(n) = 1} and 

P{f > 0 on (—00, x 0 ) and f < 0 on (aio, 00)} 
0(n~ x ) for all A > 0. 


The first portions of parts (a) and (b) of Theorem 3.1, relating only to 
the nonbootstrap case, are given by Mammen (1995). See also Mammen, 
Marron and Fisher (1992) and Konakov and Mammen (1998). 

It will follow from our proof of (3.2) that the particular construction 
of W, given the data, does not converge, and in particular that P(N* = 
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k\W) does not converge in probability as n —> oo. Therefore, when Co < oo 
the distribution of N*(n), conditional on the data, does not converge in 
probability as n —> oo. Furthermore, part (a) of the theorem implies that 
while the unconditional distribution of N*(n) does converge, it does not 
converge to the limiting distribution of N(n). Theorem 3.1 has several more 
general or more detailed forms, which are given in a longer version of this 
paper obtainable from the authors. 

Next we show that subsampling fails to remove the inconsistency prob¬ 
lems suffered by the bootstrap in the present setting. In fact the bootstrap 
distribution of N*(n), in the case of subsampling, is well approximated by 
a crude indicator function of N(n). Nevertheless, subsampling does not, 
to first order, impair consistency when the bandwidth is of larger order 
than re -1 / 5 . These properties are stated formally in Theorem 3.2. By way 
of notation, we redefine X* = X*(m) to be a resample of size m <n drawn 
by sampling randomly, with replacement, from X, and construct f* and N* 
for this version of X*. 

Theorem 3.2 (Subsample bootstrap approximation to distribution of 
N). Assume (C/i), (C/2) and (C/,), that K is the Gaussian kernel, and 
that the resample size m = min ) satisfies m —oo and m/n —* 0 as n —>■ oo. 

(a) If in addition to (C/J we have re 1//5 /i—► Co £ (0,oo), then 

sup | P{N*{n) = k\X} - I{N(n) = k} | -> 0 

fc >0 

in probability as n —> oo. 

(b) If on the other hand n l ^h —> oo, then both P{N(n) = 1} and P{N*(n ) = 1} 
converge to 1. 

3.2. Shape probabilities for densities with small modes. Theorem 3.1 has 
analogues in the case of densities with one or more shoulders, that is, points 
at which f and f" both vanish. In this case the critical size of the bandwidth 
is n -1 / 7 , rather than n^ 1 / 5 , provided f" does not also vanish at the shoulder. 

In particular, if h is of smaller order than n^ 1 / 7 then the probability 
that the number of modes of / exceeds any fixed integer converges to 1 as 
n —> oo, and if nf^'h —> Co > 0 then the distribution of the number of modes 
has a proper limit, degenerate at the atom v if / has just v modes and 
Co = oo. For the latter result it is sufficient to assume (Cji), along with 
the condition (C^) that in the interior of the support of / the equation 
fix) = 0 has just uj, say, solutions, at just 2n — 1 of which f" 0, with f" 
having three continuous, nonvanishing derivatives in the neighborhoods of 
the other u — 2u + 1 zeros of f. Constraint (C/4) implies that / has v local 
maxima, u — \ local minima and cu — 2v + 1 shoulders. 
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Shoulders may be regarded as embryonic modes, and for this reason den¬ 
sities with shoulders are of particular interest since they lie on boundaries 
separating classes of densities with different shapes, expressed through their 
“modalities.” See, for example, Cheng and Hall (1999). In both theoretical 
and numerical studies the performances of methods for assigning probabil¬ 
ities to density shapes may be assessed in terms of their success in distin¬ 
guishing between densities that have shoulders and those which have small 
modes in places that would otherwise be shoulders. With this in mind we 
shall expand the class of densities satisfying (Cji) and (Cp) by allowing 
the first and second derivatives, but not the first, second and third, to van¬ 
ish simultaneously. We shall discuss the performance, uniformly over such 
densities, of empirical methods for assigning probabilities to the numbers of 
modes and show that techniques based on counting the number of modes 
of a kernel density estimator can have optimal performance, in a minimax 
sense, if bandwidth is chosen larger than n -1// '. 

To simplify discussion we shall base our lower bound on perturbations of 
a density / with just one mode and one shoulder. Specifically, / will sat¬ 
isfy (C/i) and the following condition, which we call (C/ 5 ): In the interior 
of S the equation f'(x) = 0 has just two solutions, xq,xi G (a,b), with xq 
denoting the mode of / and satisfying f"(x 0 ) < 0 , and x\ representing a 
shoulder and such that / has three continuous derivatives in a neighbor¬ 
hood of /, f"(x 1 ) = 0 and f'"(x 1 ) 7 ^ 0. Given any empirical procedure TV" for 
counting the number of modes of a density, we would want TV" to equal 1, 
with high probability, when applied to a dataset drawn from a distribution 
whose density satisfies (Cyi) and (C 75 ). 

Now perturb / by adding a small bump at the shoulder, as follows. Let 
if} denote a symmetric, compactly supported probability density with three 
continuous derivatives on the real line, a unique mode at the origin satisfy¬ 
ing i/j"( 0 ) < 0 , no other point x in the interior of the support of if such that 
ip f (x) = 0 and such that the equation \\f"{xi)\y 2 = \ip'{y)\ has a unique 
solution on (0, 00 ) which also satisfies f'"(x\)y + ip"(y) 7 ^ 0 . Call this con¬ 
dition (G^); as we shall show in the proof of Theorem 3.3, the last part of 
(C^) ensures that the added bump produces a single additional mode. Let 
hi = hi(n) —> 0 , and let the perturbed density be 

t _ f ( x ) + h\ij){{x - xi)/hi} 

Jn[X) ~ 1 + hf 

In this formula, the factor h\ ensures that like /, f n has three bounded 
derivatives in a neighborhood of x\. The denominator 1 + hf guarantees 
that f n integrates to 1 . We would want TV" to equal 2, with high probability, 
when applied to data from the distribution with density f n . 

The density f n has just two modes, at yo = xo + o(hi) and y\ = x\ + o(/ii), 
respectively. Of course, a local minimum occurs between them [at a point 
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with formula x± + 0(h\)], but no other turning points and no shoulders 
exist. Choosing h\ larger or smaller makes the small bump near x\ more or 
less pronounced, respectively. 

Our next theorem shows that in order for it to be possible to correctly 
distinguish two modes in the density f n , based on a sample of size n, the 
rate at which h\ converges to 0 must be strictly slower than . 

Theorem 3.3 (Necessity of using large bandwidth when counting modes). 
Assume f satisfies (Cfi) and (C/ 5 ), and that if satisfies (C^,). LetM denote 
any empirical procedure for counting the number of modes of a density, and 
use it to estimate the number of modes of f and of f n , based on samples 
of size n from these respective distributions. If A f is asymptotically correct 
in each case, that is, if both Pf(M = 1) —* 1 and Pj n (Af = 2) —>■ l, then 
n x ^‘h\ —* 00 as n—> 00. 

It is likewise possible to show that if nf^h\ —* 00 then, provided the 
bandwidth h in the kernel density estimator / converges to 0 at a rate that 
lies strictly between n ~ 1//7 and the rate at which h± decreases, the naive rule 
A/" that simply counts the number of modes of / is asymptotically correct. 
In this sense it achieves the level of precision that is shown by Theorem 3.3 
to be optimal. See Mamrnen (1995) for discussion and details. Konakov and 
Marnmen (1998) treat the multivariate version of this result. 

3.3. Probability distribution of excess mass. In Section 2.2 we defined 
the excess mass, A m , of /, and discussed potential applications of approxi¬ 
mations to the distribution of the empirical form, A m , of this quantity. Here 
we describe the limiting distribution of A m in the case m = 2 . 

Given intervals Li,L 2 as at (2.2), let Lqj denote the version of Lj that 
produces the second supremum there in the case m = 2. Assume / = F' is 
bimodal, let Ao denote the value of A that maximises £2 (A) — £i(A), and 
write L01 = (ad,22) an d -^02 = (23,24), where without loss of generality, 
21 < • ■ ■ < 24 . In this notation, 

£2(A 0 ) = {£(22) - £(21) - A 0 (22 - 21)} 

(3.4) 

+ {£(24) - £(23) - A 0 (24 - 23)}. 

Note that f(xi) = Ao for 1 < i < 4. 

We shall suppose too that one mode contains strictly less excess mass 
than the other, in the sense that the mode from which mass is removed, and 
placed into the trough between the modes when the nearest unimodal density 
is constructed, is uniquely defined. We shall call this mode the “smallest 
mode.” Without loss of generality the smallest mode is the second of the 
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two modes, lying between X 3 and X 4 . See Figure 5 of Muller and Sawitzki 
(1991) for an illustration of this case. 

Next we define the limiting distribution of A 2 ; it is a mixture of correlated 
Normals. Let N = (IV 2 , IV 3 , IV 4 ) denote a trivariate Normally distributed vec¬ 
tor with zero mean and covariances given by var (Ni,Nj) = F(xi){l — F(xj)} 
for i <j; redefine £1 and £2 to be standard Brownian motions, stochastically 
independent of N; and define I to equal 1 if 


supj&N 

sup{^4 (u) 


u 2 } 


u 2 } 


< 


f(x 2 ) 


f'{x 4 ) 


1/3 


and to equal 2 otherwise. Finally, put Z = N 2 i — N 3 . 

We assume of / that it has a continuous derivative, ultimately monotone 
in each tail, that the constraints f'{x) = 0 and f(x) 0 are jointly satisfied 
at just three points, < x^ < x®, in the neighborhood of each of which 
f" exists and is continuous, and f”(x^) < 0, f"(x^>) > 0 and f"(x < 0, 
and the points xi,...,X 4 at (3.4) are such that each f'(xi ) 0. Call this 
condition (Cjq). Let Ag denote the version of A 2 computed not from X 
but from X*, the latter obtained by sampling randomly, with replacement, 
from X. 


Theorem 3.4 (Consistency of bootstrap estimate of excess mass distri¬ 
bution). Assume f satisfies (C/6)- Then the distribution of n 1 / 2 (A 2 — A 2 ) 
converges, as n—+ 00, to the distribution of Z. Furthermore, the conditional 
distribution of n 1 / 2 ( A 2 — A 2 ), given X, converges in probability to the dis¬ 
tribution of Z. 

It follows from Theorem 3.4, and symmetry of the distribution of Z, that 
both the standard percentile bootstrap methods consistently estimate quan¬ 
tiles of the distribution of A 2 . Therefore, the percentile bootstrap produces 
confidence intervals for the excess mass A 2 that have asymptotically correct 
coverage accuracy. For instance, if 0 < a < 1 and t a is defined to be the 
infimum of values t such that P( A?j < t\X) > a, then it follows from the the¬ 
orem that P(A 2 < t a ) —* a as n —> 00. The percentile bootstrap technique 
was used in Section 2.2 to construct confidence regions for A 2 and hence to 
compute estimates of / whose shapes (in terms of their “modal sharpness” 
or “delineation,” as expressed through excess mass) correspond to particular 
quantiles. 

Theorem 3.4 is readily extended to show that, under regularity condi¬ 
tions analogous to and for general m > 2, the limiting distribution 

of n 1 / 2 (A m — A m ) is consistently approximated by the conditional distribu¬ 
tion n 1//2 (A* n — A m ). Of course, such a result fails if, when computing the 
bootstrap approximation, we mistakenly constrain the initial estimator fz 
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to have too few modes. (See Section 2.3 for discussion of fz-) In particular, 
if we constrain fz to have M < m o modes, where mo denotes the true num¬ 
ber of modes of /, then fz converges not to / but to an M-mode density 
that is nearest to / in a sense that can be defined in terms of the distance 
measure, d, used for the data-sharpening algorithm. On this occasion, this 
basic inconsistency renders invalid any bootstrap approximations that start 
from fz- 


4. Proofs of Theorems 3.1 and 3.2. The nonbootstrap parts of Theo¬ 
rem 3.1 are given by Mammen (1995), but since parts of his argument are 
needed for the bootstrap case and for Theorem 3.2, they are reproduced in 
outline form here. 


4.1. Monotonicity of f outside (xq — £,xo + e). Define D\(x) = /'(x) — 
E{f'(x)} and t = logn. The argument used to derive Lemma 6 of Mammen, 
Marron and Fisher (1992) [see also Silverman (1983)] may be employed to 
prove that for each A > 0 there exists B = B{ A) > 0 such that 

p{ sup \£>i(x)\ > Bii/nh 3 ) 1 / 2 } =0{n~ x ). 

\.a<x<b J 


Note too that E{f(x )} = f(x) + o(l) uniformly in x € [a + 6 , b — J] for each 
5 > 0, while E{f'(x)} > |/'(x) + o( 1) uniformly in x € [a,xo] and E{f'(x)} < 
\f'(x ) +o(l) uniformly in x € [xo,&]. H follows from these properties that 
for each e € (0, min(x'o — a,b — xo)), and all A > 0, 

P{f' > 0 on [a, xq — e] and f'< 0 on [xq + £,b]} = 0(n~ x ). 


The definition of / implies directly that with probability 1, f > 0 on (—oo, o) 
and f < 0 on (6, oo). Hence, for each e G (0,min(xo — a, b — xq)), and all 


A > 0, 


(4.1) 


P{f > 0 on (—oo, xo — e] and f'< 0 on = [xo + £, oo)} 
= 1 — 0 (n~ x ). 


Similarly, (4.1) holds if f 1 is replaced by (/*) / . 


4.2. Approximation to f and (/*)' on (xo — £,xq + e). Define D\(x) = 

(r)'(x) - /'(x), 

^i(x) = J K”(u)Wq{F(x + hu)} du, 

£*(x) = J K"(u)Wq{F(x + hu)}du, 
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where Wo, and Wq conditional on both X and Wo, are standard Brownian 
bridges, and F denotes the conventional empirical distribution function of 
the sample X from which / was computed. It may be proved, using the 
embedding of Kornlos, Major and Tusnady (1976), that Wo and Wq may be 
constructed such that 


(4.2) 


D 1 (x)=n l / 2 h 2 £i(x) + Ri(x), 

£>l(x) = n-V 2 h- 2 £(x) + R\(x), 
where for each 5, A > 0 and e € (0, min(x'o — a,b — xq)), 


(4.3) 


pj sup |i?i(x)|>n 1+s h 2 | = 0 (n A ), 

f|a;— xo\<£ 1 

pj sup \Rl(x)\>n- 1 +S h~ 2 \ = 0(n~ x ). 

via;— xn\<e ) 


4.3. Monotonicity of f and (f*)' outside (xq — Ch,xo + Ch). Note that 
E{f'(x)} = f (x) + o(h) uniformly in x £ (xo — e, xo + £ ), for sufficiently small 
e > 0, and that 

(4.4) sup | E{f'(x 0 + hy )} - hyf"(x 0 )| = o(h) 

\y\<c 

for any C > 0. It may be deduced from these results, the fact that h is not 
less than a constant multiple of ttT 1 / 5 , and properties of a Brownian bridge, 
that for each C\, 6 >0 there exists C > 0 such that for all sufficiently small 
£ > 0, and all sufficiently large n, 

P{n -1//2 /zT 2 £i(x) + E{f'(x)} is greater than C\h 


(4.5) for —£<x — xo<—Ch, 


and is less than —C\h for Ch < x — xq < e} > 1 — 5. 

Similarly we may prove that (4.5) holds if we replace £i by +^|. If 

for some 6 > 0 then both results may be strengthened by replacing “> 1 — S” 

on the right-hand side of (4.5) by “= 1 — 0(n ~ x ) for all A > 0.” 

Combining (4.1)-(4.3) and (4.5) we deduce that for each 5 > 0 there exists 
C > 0 such that for all sufficiently large n, 


(4.6) 


P{f'(x ) is strictly positive for x < xo — Ch 

and strictly negative for x > xq + Ch} > 1 — 5, 


and that (4.6) continues to hold if f is replaced by (/*)'. Moreover, both 
results continue to hold with “> 1 — 5” on the right-hand side of (4.6) re¬ 
placed by “= 1 — 0(n ~ x ) for all A > 0,” provided n^' 5 ^~ s h —* oo for some 


<5 >0. 
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4.4. Approximation to f and (/*)' on (xo — Ch,xo + Ch). Define D(y ) = 
D\ (xo + hy) and D*(y) = D\ (xo + hy ). Noting the continuity properties of a 
Brownian bridge and that for each 5, X > 0 the probability that sup \F — F\ 
exceeds n 5- ! 1 / 2 ) equals 0(n _A ), we may deduce that, defining 

e 2 {y) = I K"(u)Wq {F(xo + hy + hu)}du , 
it is true that for each C, 5, A > 0, 


P sup |£(a;o + hy) - £ 2 *(y)| > n s ~^ = 0(n“ A ). 

t\y\<C J 

From this result, (4.2), (4.3) and the fact that h is no smaller than a constant 
multiple of n -1 / 5 , we may deduce that for each C, A > 0 and some 8 > 0, 

(4.7) p\ sup | D*{y) - n- 1 / 2 h~ 2 &{y)\ > n -B/ 2 M h~ 3 / 2 ) = 0(n~ x ). 

Ay\<C J 

Note that we may write Wo(t) = V{t) — tV(t), where V is a standard 
Brownian motion, and that Wft may be represented analogously. These prop¬ 
erties, and arguments similar to those in the previous paragraph, allow us 
to show that if we define £ and £* as at (3.1), for appropriate choices of W 
and W*, then for some 8 > 0 we have for each C, A > 0, 


P 


sup {My) - h^iy)] + My) - h}/ 2 e{y)\} > n-M ' 2 

\y\<c 


= 0 (n ~ x ) 


From this result, (4.2), (4.3) and (4.7) we may deduce that 


(4.8) 


D(y) = (nh 3 ) 1 / 2 {£(y) + R(y)}, 
D*(y) = (nh 3 )- 1 / 2 {C(y) + R*(y)}, 


where for some 8 > 0 and each C, A > 0, 


(4.9) 


P 


sup {\R(y)\ + \R*(y)\}>n 5 

Uv\<c 


Oin~ x ). 


In the notation of (4.8), 

f'{x o + hy) = ( nh 3 )~ 1/2 My ) + R(y)} + E{f'(x 0 + hy)}, 
(4.10) (f )'(x o + hy) = (uk 3 )- 1 ' 2 }^) + £*(y) + R(y) + R*(y)} 

+ E{f'(x 0 + hy)}. 
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4.5. Proof of Theorem 3.1(a). Define y and y* , in terms of £ and £*, as 
at (3.1). It follows from (4.4), (4.9) and (4.10) that 

C (y) = n 1/5 f'(x o + hy) = y(y) + o p ( 1), 

C*(y) = n 1/5 (f*Y(x o + hy) = rf(y) + o p ( 1 ), 

both results holding uniformly in \y\ < C. The processes rj, y*, f and (* 
are all differentiable, each derivative equals O p ( 1 ) uniformly on [—C,C], 
and (' = rj + o p (l) and (£*)' = (y*)' + o p (l) uniformly on [—C , C], for each 
C > 0. Furthermore, if the downcrossings of 0 by y on [— C, C\ occur at points 
Z\,... , Zm, where M < N, then (a) P(M = N) —> 1 as C —> oo, (b) with 
probability 1 no Z* equals C or —C, and (c) for each e > 0 , 

lim P{\y'(Zi)\ > e for each i} = 1. 


Together these properties imply that for each C > 0, 

P{the number of downcrossings of 0 by f 


(4.11) 


on (xq — Ch,XQ + Ch) equals 

the number of downcrossings of 0 by y 


on (xq — Ch ,xq + Ch)} —> 1 


as n—» oo. Similarly, (4.11) holds if (/',? 7 ) is replaced by (( f*)',V*)■ Theo¬ 
rem 3.1(a) follows from (4.6), (4.11) and their bootstrap forms. 


4.6. Proof of Theorem 3.1(b). Minor modifications of the previous argu¬ 
ments show that when nf^h —► 00 , P{N(n) = 1} and P{N*(n) = 1} both 
converge to 1. Next we prove that when nC/^~^h —> 00 for some 5 > 0, we 
have for each A > 0, 

(4.12) P{N(n) = 1} = 1 — 0(n~ x ). 

Similar arguments may be used to obtain the same identity for the other 
two probabilities at (3.3). 

Observe from (4.6) and the comments which immediately follow it that 

(4.12) will follow if we show that for each C > 0, 

P{f' has at most one zero in (x$ — Ch,xo + Ch)} = 1 — 0(n -A ) 
for each A > 0. This result is in turn implied by: for each A > 0, 

(4.13) P{f" has no zeros in (xo — Ch ,xo + Ch)} = 1 — 0(n~ A ). 

We may establish an analogue, for f", of the first parts of (4.9) and (4.10), 
f"{x 0 + hy) = (n/i 5 )~ i/ 2 {£ , ( 2 /) + s (v)} + E {f"(x 0 + hy)}, 
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where, for some e > 0 and each C, A > 0, 

(4.14) pi sup \S(y)\ >n~ £ \ =0(n~ x ). 

Mwl<C ) 

The condition —> oo, which we are currently assuming, implies that 

(n/i 5 )~ 1//2 = 0(n ~ £ ) for some e > 0. From this result, (4.14) and properties 
of a Brownian motion, we may deduce that for each e > 0 and all C, A > 0, 

p\{nh 5 )~ 1/2 sup |^'(y) TS 1 ^)! >e) =0(n" A ). 

L \y\<c J 

It follows from this property, the fact that f"( 0) < 0 and the expansion 
E{f"(x o + hy )} = f"(x o) + o(l) uniformly in \y\ < C, that the probability 
that f" < 0 throughout (xo — Ch,xo + Ch) equals 1 — 0{n ~ x ) for all C, A > 0. 
This implies (4.13). 

4.7. Outline proof of Theorem 3.2. Result (4.6), and the properties noted 
immediately below it, continue to be valid in the present case. And (4.7) 
holds in the following form, for the same definition of as before: for each 
C, A > 0 and some 5 > 0, 

pf sup | D*{y) - n- 1 m 1 / 2 h~ 2 ^{y)\> n-^- 5 h^ 2 \ =0{n~ x ). 

Mwl<c J 

Thus, in place of (4.9) and the second part of (4.10) we may write 
(/*)'(* o + hy) = {nh?)~ l/2 {i{y) + o p (l)} + hyf"{y) 

= f'{x o + hy) + o p {(nh 3 )~ 1/2 }, 

uniformly in \y\ < C. The argument in Section 4.4 may now be used to show 
that the probability, conditional on T, that the number of downcrossings 
of 0 by ( f*Y equals the number of downcrossings of 0 by q. converges to 
1 as n —> oo. Likewise, the unconditional probability that the number of 
downcrossings of 0 by rj equals the number of downcrossings of 0 by f 
converges to 1. Part (a) of Theorem 3.2 follows from these properties, and 
part (b) may be derived similarly. 

5. Proofs of Theorems 3.3 and 3.4. 

5.1. Proof of Theorem 3.3. First we show that, under condition (C^,), 
the density f n has just two modes, one local minimum and no shoulders on 
its support, for all sufficiently small h \; call this property (P). It will follow 
that M is asymptotically correct if it concludes (with probability converging 
to 1 as n —> oo) that / and f n have just one and two modes, respectively. 
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In view of the definition of f n , any turning point of f n on (a, b) that is 
not identical to xq must converge to x\ as n —» oo. Assume without loss of 
generality that f"'(x i) > 0, and note that 

(1 + h\)fn( x l + hiy) = f(x i + hiy) + h\ij/(y) 

= tii{\f"\x 1 ) y ‘ 2 + ^{y) + o( 1)} 

as h\ —> oo, uniformly in \y\ < C for any C > 0. These formulas, and the 
assumption [part of (C^)] that the equation 

(5-1) \f"'{x 1 )y 2 = \^\y)\ 

has a unique solution yo in (0,oo), imply that any turning point y\ of f n 
that converges to x\ as n —> oo, and is not identical to x\, must satisfy 


(5.2) 


y 1 = x 1 + hiy 0 + o(hi), 


where y\ is the solution of (5.1). Moreover, since / w (xi)yo + ip"{yo ) A 0, 
again by virtue of (C^), then the equation f'(x\ + hiy) + h\ij)'{y) = 0 can 
have no more than one solution y\ satisfying (5.2). It follows that y\ must 
represent a unique local minimum between xq and x\ , and that (P) holds. 

We may view TV" as a rule for discriminating between / and f n , determin¬ 
ing that the n-sample from which TV" is computed comes from f n if TV" = 2 
and comes from / otherwise. If P/(TV" = 1) —► 1 and Pf n (J\f = 2) —> 1, then 
TV" provides asymptotically perfect discrimination, and so, by the Neyman- 
Pearson lemma, the likelihood ratio rule also provides perfect discrimination. 
It suffices to show that the latter property implies n 1 / ‘hi —> oo. 

We shall argue by contradiction and show that if n l ^h is bounded as 
n —► oo through some infinite sequence, A say, then the likelihood ratio 
rule does not provide asymptotically perfect discrimination along the se¬ 
quence. It may be assumed without loss of generality that nh —s- oo as n — » oo 
through A, since otherwise a simple subsidiary argument produces a con¬ 
tradiction. 

Observe that, in view of the compact support of i/i, 


(1 + h\) 


fn(x ) 

fix) 


= 1 + h\ 


fix. 


■V’ 


X — X\ 
hi 


uniformly in x £ (a, b), as n —> oo through values in A. The log-likelihood 
ratio is therefore 


= X>g{/n(*i)//(*i)} 

i= 1 




fiXi) 




Xi - x\ 
hi 


-hi 
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2 

= ( nh[) l / 2 aZ — in/ijcr 2 + o p (l), 


where o -2 = (/ i/f 2 )/f(x i), the random variable Z is asymptotically standard 
normal, and the remainders are of the stated orders as n —» oo through Al. 
Therefore LR = O p (l) as n —> oo through values in A, and so it is not possible 
for the likelihood-ratio test to discriminate, with asymptotic probability 1, 
against f n for data from / as n —* oo through A. 


5.2. Outline proof of Theorem 3.4. We shall derive only the first, un¬ 
conditional limit theorem; the second, conditional bootstrap result may be 
proved similarly. At a key point in the latter proof, where the bootstrap form 
Wq of a Brownian bridge is used in the form of a function of the empirical 
distribution function F (cf. Section 4.2), we may replace ?tA 1//2 IUq (F) by 
u~ 1 ^ 2 Wq{F) and incur an error of only O p (n~ 3 / 4 log n). This is of smaller 
order than the error of size n~ 2 / 3 that arises if the approximation is sub¬ 
sequently pursued using arguments developed below, in the nonbootstrap 
case. In this way it can be seen that the “in distribution” limits are identical 
in the two cases. 

Using the embedding of Komlos, Major and Tusnady (1976) we may, for 
each n, construct a standard Brownian bridge Wo such that 

F(x) = F(x) + n~ 1 / 2 W 0 {F{x)} + O p {n~ l l ), 

uniformly in x, where i = logn. Of course, Wo(t) = B(t) — tB( 1) for a 
standard Brownian motion B. Put = rW 1 / 3 , write yi = X{ + rjm where 
supj \ui\ <C for some fixed C > 0, and define Af = Wo{F(xi)} and 

Wi(t) = (A 0 r,)- 1 l 2 [B{F{x i ) + Xovt} ~ Ni\. 

Then W{ is a standard Brownian motion, and F(yi) = F(xi) + Xo'r]Ui + 0(r] 2 ). 
Therefore, using properties of the modulus of continuity of B, we deduce that 

(5.3) F{ yi ) - F( yi ) = rT^Ni + (A 0 y/nf^Wifuf) + O^i/n 1 / 2 ) 

uniformly in |itj| <C. 

Put 5i = r)u t and define A to denote the operator describing the pertur¬ 
bation arising when Xi is changed to yi = Xi + 5i, for 1 < i < 4, small \S t \ and 
A = Ao held fixed. For example, A iF(xi) = F(xi + §i ) — F{xf). Then, since 
each f(xi) equals Ao, A {F(xi) — Ao®i} = \5 2 f'(xi ) + o(t/ 2 ). From this result 
and (5.3) we deduce that if A = Ao + rfv then 

F{y%) - F{yj) - Kvi - yj) 
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= F(xi) - F(xj ) - A 0 (xi - Xj) + \r) 2 {f'{xi)u 2 - f'(xj)u 2 } 

+ n-V 2 (Ni - Nj) + (Ao v/n) 1 / 2 {Wi(ui) - Wj( Uj )} 

- rj 2 v(xi - Xj) + Op^lrC 1 / 2 + r/ 3 ), 

uniformly in |itj|, |u| <C. Equivalently, if we define the intervals L = ( Uj,yi ) 
and Lq = ( Xj,Xi ) then 

F(L)-\\\L\\-{F(L 0 )-\ 0 \\L 0 \\} 

= n- 1 ' 2 (N i -N j ) 

+ V 2 iK / 2 {Wi(ui) - Wj{uj)} + \{f\ Xi )u 2 - f\xj)u 2 } - v(xi - Xj)] 
+ O p (r/£n -1 / 2 ). 

Therefore, if T (1) = ( 221 , 2 / 2 ), £ (2) = ( 2 / 3 , 2 / 4 ), = (xi,x 2 ) and Tq 2) = 

(x 3 ,x 4 ) then 

£{F(L«)-A||L«||} 

Z— 1 

= E 2 (A 0 ) + n" 1/2 (iV 2 + iV 4 - JVi - JV 3 ) 

+ r ? 2 [Ao /2 {lT 2 (u 2 ) + W 4 M - WiK) - W 3 (u 3 )} 

+ + f'{x^)u\ - f'{xi)ul - f'(x 3 )u\} 

- u(x 2 + x 4 - Xi - x 3 )] + o p (r] 2 ). 

Taking the supremum over u\,... ,u 4 , and noting that C in the bound \m\ < 
C is an arbitrary although fixed number, we deduce that 

E 2 ( A) = E 2 (A 0 ) + n~ 1 ^ 2 (N 2 + iV 4 - Ad - iV 3 ) 

4 

(5.4) +r/ 2 Ao /2 ^sup{Sj(u)-6jU 2 } 

, U 
1=1 

- T] 2 v(x 2 +X4-X1- x 3 ) + Op(? 7 2 ), 

where B{(u ) = (—l)*Wi(u) is a standard Brownian motion process, and 6* = 
(—1 y + 1 f'(xi) > 0. Strictly speaking, E 2 ( A) on the left-hand side and the 
supremum on the right-hand side are defined with the suprema taken only 
over |itj| < C. However a subsidiary argument shows that (5.4) holds when 
the suprema are interpreted over the whole real line. 

Similarly, 

Ei( A) = E 1 (A 0 )+n- 1/2 (^ 2J - Ad) 
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(5.5) 


+ r? A 


2x1/2 


sup{B 2 j(u) - b 2 J u 2 } + sup{Bi(u) - biu 2 } 

U U 


~ rj 2 v(x 2 j - xi) + o p (rj 2 ), 
where J = 1 or 2 according as 

sup{B 2 (u) — b 2 u 2 } > sup{-£? 4 (u) — b^u 2 } 

U U 


is true or false. Subtracting (5.5) from (5.4), taking the supremum over v 
and writing I = 3 — J, we deduce that 

(5.6) A 2 = A 2 + n~ 1 ^ 2 (N 2I - N 3 ) + O p fq 2 ). 


[Much as in the cases of (5.4) and (5.5), a subsidiary argument shows that 
the suprema over v may be taken over the whole positive real line, not just 
over |u| < C.] 

i/o o/o 

We may write b t ' Bi(u) = £i(t) where f = 6-' u and ^ is, like Bi, a stan¬ 
dard Brownian motion. In this notation, Bi(u ) — biU 2 = b j ^{^(t) — t 2 }, 
and so I = 1 or 2 accordingly as 


sup{g 2 (u) 

sup {U(u) 


u 2 } 


u 2 } 


< 


/'(* 2) 


f'{x 4) 


1/3 


is true or false, respectively. The variables N 2 , IV 3 , and N 4 have a joint 
Normal distribution with zero mean and covariances given by var (Ni,Nj) = 
F(x{){ 1 — F(xj)} for i < j. Furthermore, the iVj’s are asymptotically in¬ 
dependent of the processes The theorem follows from these properties 
and (5.6). 


APPENDIX A 

Description of data sharpening for constraining excess mass. The method 
is based on a density estimator, which we shall denote by /. This can be 
either a conventional estimator, /, computed from X, and which we could 
denote by fx to indicate that fact or an estimator computed after X has 
been sharpened to Z = {Z±,Z n }, say. In this case we denote the estima¬ 
tor by fz- The dataset Z might be chosen so that fz has a given number 
of modes. See the next paragraph for further discussion. Of course, the case 
f = fz subsumes / = fx as a special, degenerate case, so we may take f = fz 
below. 

If one of our aims is to ensure that fz has just m (say) modes, where 
m is different from the number of modes of fx, then we might proceed 
as follows. First, choose a bandwidth for the density estimator (usually by 
employing a standard method applied to the original dataset), and let d(-, ■) 
denote a nonnegative measure of distance on the real line. It need not be a 
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metric, but for ease of interpretation it should be symmetric. For example, 
d(x, y) = (x — y ) 2 is a possibility. Put d(X, Z ) = d(Xi,Zi ), and choose Z 
to minimise d(X,Z) subject to the constraint that fz just has m modes. 
[See Hall and Kang (2002) for discussion.] 

The mth mode will in fact be a shoulder, but can be made more pro¬ 
nounced (once the shoulder is achieved) by transferring the constraint to 
one on excess mass, rather than on the number of modes. Specifically, once 
the estimator fz with just m modes is attained, sharpen Z to T by mini¬ 
mizing d(Z, y) subject to fy having an increased value of excess mass; that 
is, A m(fy) = A, where A > A m (fz) would typically be chosen to be an 
estimator of a quantile of the distribution of A m (f). 

In each case the constraints may be imposed using methods based on 
simulated annealing; this approach is elementary in terms of code, although 
lengthy from the viewpoint of computing time. The algorithm is described 
in Appendix B. 

For example, if we wished to use m = 3 in the algorithm discussed above, 
but the estimator fx had only one mode, the first step would generally 
be to sharpen X to Z so that fz had three modes. Nevertheless, although 
the value of m could be greater or less than the actual number of modes 
of fx, usually it would be less than that number, reflecting the fact that 
standard kernel density estimators (with appropriately chosen bandwidths) 
tend to have more, not fewer, modes than the true density. Note too that 
perhaps not all the modes will be substantial, in the sense of excess mass 
(see Section 2.2). 


APPENDIX B 

Algorithm for data sharpening subject to constraints on excess mass. 

Let the “starting” dataset be Z = {Z \,..., Z n }, and denote by Am our 
bootstrap estimator of the ce-level quantile of the excess mass distribution. 
Let the sharpened dataset be y = {Y\ ,... ,Y n } and define the distance be¬ 
tween Z and T to be D(Z,y) = — L}) 2 . We seek y to minimize 

D(Z,y) subject to A m {fy) = A m\ This problem is solved by a standard 
simulated annealing algorithm, the perturbations of which (within the an¬ 
nealing loop) are generated as follows. 

Let / denote the density estimator fz, and write / max for its maximum 
value. At the next step of the algorithm we decide whether we wish to 
make the data less or more “diffuse,” based on whether the current excess- 
mass statistic A m (fy) is less than or greater than the target value A^, 
respectively. For a given data point j/j, where 1 < i < n, we generate a move 
as 

Vi <- Vi + SZi exp {-f{yi)f /max} 


(B.l) 
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if we wish to make the data less diffuse or 

(B.2) Hi < Ui + SZi exp[{/(?/j) /max}//max] 

if we wish to make them more diffuse. Here s is a constant equal to the 
range of Z divided by 1000 (a value which was chosen by trial and error), 
and Zi is a number drawn randomly from the standard Normal distribution. 
Using formulas (B.l) and (B.2) to govern the perturbations was found to 
give better convergence rates than employing a naive perturbation formula. 

The perturbation of y indicated by (B.l) or (B.2), for 1 < i < n, was ig¬ 
nored if it took A m(fy) further from the target value A m\ The algorithm 
was terminated when A m (fy) got within s of A^. We repeated this proce¬ 
dure 100 times and selected as the solution the configuration with the lowest 
value of D(Z,yj). 

In practice, this algorithm always converged. In numerical experiments, 
to check whether the limit was significantly affected by early steps taken by 
the algorithm, we sometimes started it from small perturbations of Z, but 
nevertheless reached the same limit. 

Acknowledgment. We are grateful to three reviewers for helpful com¬ 
ments. 
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